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Abstract 

Markov decision processes (MDPs) are a well studied framework for solving se¬ 
quential decision making problems under uncertainty. Exact methods for solving 
MDPs based on dynamic programming such as policy iteration and value iteration 
are effective on small problems. In problems with a large discrete state space or 
with continuous state spaces, a compact representation is essential for providing 
an efficient approximation solutions to MDPs. Commonly used approximation al¬ 
gorithms involving constructing basis functions for projecting the value function 
onto a low dimensional subspace, and building a factored or hierarchical graphical 
model to decompose the transition and reward functions. However, hand-coding 
a good compact representation for a given reinforcement learning (RL) task can 
be quite difficult and time consuming. Recent approaches have attempted to auto¬ 
matically discover efficient representations for RL. 

In this thesis proposal, we discuss the problems of automatically construct¬ 
ing structured kernel for kernel based RL, a popular approach to learning non- 
parametric approximations for value function. We explore a space of kernel struc¬ 
tures which are built compositionally from base kernels using a context-free gram¬ 
mar. We examine a greedy algorithm for searching over the structure space. To 
demonstrate how the learned structure can represent and approximate the origi¬ 
nal RL problem in terms of compactness and efficiency, we plan to evaluate our 
method on a synthetic problem and compare it to other RL baselines. 


1 Introduction 

This report considers sequential decision making problems where decisions can have both immediate 
and long-term effects. Each decision results in some immediate reward or benefit, but also affects 
the environment in which further decisions are to be made and thus affects the expected reward 
incurred in the future. The objective of the decision maker is to choose decision making policies 
optimally, that is, to maximize some long-term cumulative measurement of rewards. Such objective 
is challenging mainly because of the tradeoff between upfront and future rewards. Markov decision 
processes 13211^ (MDPs) provides a mathematical formalization for this tradeoff. 

1.1 Markov Decision Process 

A MDP is mathematically defined in terms of a tuple (S, A, V, TZ), where 

• 5 is the finite set of all possible states that describes the context of the environment, also 
called the state space', 

• A is the finite set of all actions the decision making agent can take; 

• V '■ S X A X S —> [0,1] is a transition function, a mapping specifying the probability 

of going to state s' when performing action a in state s. An essential assumption 
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made in the MDP is that the dynamics of state evolution is Markovian, meaning that the 
distribution of the next states is conditionally independent of the past, given the current 
state. 

• TZ : S X A X S Mis a reward function. s' describes a payoff or reward 
obtained when the agent goes from state s to state s' as a result of executing action a. The 
reward can be either positive or negative, representing an utility or a cost, respectively. 

The optimality objective is to find a way or a policy to maximize some measure of the long turn 
reward received. A (stationary) policy tt : 5 ,4, is a mapping from states to action, which specifies 
an action to be taken for each state. The choice of action is independent of the time, depends only 
on the state. Given a policy, we can define a value function 14(s) on the state space, which is the 
expected long run value an agent could expect to receive by choosing the action dedicated by the 
policy. A policy tti is said to dominate another policy 1:2 if, (s) < ^ 4-2 (s) for any state s G S, 

and 3si G S such that 14i(si) < 142 (si)- ^ fundamental theorem ||2l in MDP stated that there 
exists a stationary policy tt*, called the optimal policy, that dominates or has equal value to all other 
policies. The existence of such an optimal policy relies on the assumption that the expected long 
term reward, which is the objective function in the MDP, accumulates additively over time. That is 
to say, at each state, the optimal policy ranks the actions based on the sum of the expected rewards 
of the current time step and the optimal expected rewards of all subsequent steps. 

To ensure the value function is well defined, one can limit the MDP to a finite number of time steps. 
In this case, the summation over rewards incurred in subsequent time steps terminates after a finite 
number of terms N, called the horizon, and the corresponding MDP is called a finite horizon MDP. 
The value of a policy tt, starting from an initial state sq. is 

7V-1 

V^{s) = E[i?(sAr) + ^ R{sk,TT{sk),Sk+i) | so = s] (1) 

k=0 

where R{sn) is a terminal reward for ending up with the final state sn, and the expectation is 
taken with respect to the probability distribution of the Markov Chain {so,si,---,SAr} starting at 

the initial state s, with transition probability matrix Psk^J‘k+i- Ths optimal value function and the 
optimal policy is denoted by V*^{s) and tt* (s), respectively; that is, 

V*^{s) = ma.xv^ (s) (2) 

TT 

7 r*(s) = argmaxr;^ (s) (3) 


Despite the simple mathematical properties of the finite horizon MDPs, in many tasks, the reward is 
accumulated over an infinite (or indefinite) sequence of time steps. We refer this kind of tasks as the 
infinite horizon problems. There are three principal classes of infinite horizon problems. 

(a) Discounted problems. Here we introduce a discount factor 7 with 0 < 7 < 1. The reward 
incurred at the tth transition is discounted by a factor 7 *. Then the value function over an 
infinite number of time steps is given by 

00 

f4(s) = E[^ y'"R{sk,TT{sk), Sk+i) I So = s] (4) 

k=0 

In our assumption, the one step reward is bounded from above by some constant, say, M. 
Therefore, t;^(s) < infinite sum of decreasing geometric progression is 

finite for all policies tt in all situations. 

(b) Stochastic Shortest Path Problems. Here 7 = 1 but we assume that there exists some addi¬ 
tional termination state. Once the Markov chain reaches the termination state it remains there 
without any further rewards. The rewards (costs) associated with other states are negatively. 
In addition, the Markov chain is assumed to be such that termination is inevitable within finite 
number of steps, at least under an optimal policy. Thus, the problem is in effect a finite horizon 
one, but the length of horizon may be random. It can be shown that any discounted problems 
can be converted to a stochastic shortest path problem. 
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(c) Average reward problems. Without the discount factor, the sum over an inhnite sequence of 
rewards may be infinite, however, it turns out that in many problems the average reward per time 
step, given by 

lim (5) 

N—^oo 

where (s) is the iV-horizon value function of policy tt starting at state s, is well defined as a 
limit and is finite. 

The optimal value function V* (s) can be shown to satisfy the well known Bellman equation 

y*(s) = maxE[i?(s, a, s') + 7 !^*(s')]. ( 6 ) 

a^A 


1.2 Representations of MDPs 

Exact solutions to MDP, such as value iteration Q , policy iteration iflTll . and linear programming ||9|, 
involve a lookup table representation of the value function, in the sense that the whole vector V(s) is 
kept in memory for each state s. The complexity of these algorithms are at least polynomial 1^ in 
the size of the state space |<S| as well as the size of action space |A|. However, the order of the poly¬ 
nomials is large enough that those exact algorithms are not efficient in practice. The computation 
requirements of large scale MDP are still overwhelming. In such problems a sub-optimal approxi¬ 
mation solution using compact representation of MDPs needed to be used, compact representations 
for approximately solving MDPs. Widely used compact representations include 

• Construct a low dimensional vector space representation of the value function by building 
a set of linear basis functions 0 . 

• Kernel (instance) based methods Il28ll that represent the value function as a convex combi¬ 
nation of observed values in the simulation samples. 

• Factored MDPs || 6 l construct a representation of the state space using a vector of state vari¬ 
ables, and represent the transition models between state variables using a dynamic Bayesian 
network. 

• Hierarchical representations MM of MDPs exploit the task structure, where the actions 
are temporally extended. 

• Symbolic representation of MDPs express the state space as binary decision dia- 
grams(BDD) and algebraic decision diagrams(ADD) ifT^ . 

However, hnding a good compact representations for a given reinforcement learning (RL) task re¬ 
quires carefully hand-coding by a human designer, which can be quite difficult and time consuming. 
We further review recent developments in automatic discovery of efficient representations in MDPs. 
We elaborate the problems of automatically constructing structured kernel for kernel based RL, a 
popular approach to learning non-parametric approximations for value function. We provide algo¬ 
rithms for exploring a space of kernel structures which are built compositionally from base kernels 
using a context-free grammar, and greedy algorithms for searching over the structure space. 
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2 Solutions for a Lookup Table Representation 

In this section, we review basic solutions to MDP with a lookup table representation of value func¬ 
tion. 

There are two fundamental classes of exact solution methods to MDPs. The first approach is based 
on iterative algorithms that use dynamic programming, whereas the second approach formulates an 
MDP as a linear program. These exact solutions require a perfect knowledge of the explicit models 
of the reward structure and transition probabilities of the system, which many not be available. 
Simulation methods based on Monte Carlo simulations, instead requires only sample transitions 
(st, at, rt, st+i) of the system. 

The iterative algorithms typically employs the Bellman equation|6]to recursively relating the value 
of the current state to values of adjacent states. The form of Bellman equation motivates the intro¬ 
duction of two essential operators, also known as Bellman backup or dynamic programming backup 
operators in literature, that provide a convenient shorthand notation in expressions. 

For any vector V = (C(1),...,F(|5'|)), we consider the vector TV obtained by applying one 
iteration of right hand side of Bellman equation: 

{TV){s) =max^p“^,(i?(s,a,s')-f7y(s')) (7) 

“€,4 ■‘7—' 
s'gS 


and similarly, for any vector V and any stationary policy tt, we consider the vector with com¬ 
ponents 


{T^V){s) = J^p^jf^(R(s,7r(s),s')+jV(s')) 

(8) 

s'gS 


Given a stationary policy tt, we define the |iS| x |5| matrix whose {i,j) entry is 

can re-write T^^V in matrix form as 

. Then we 

TttV = Rtt + 'jPttV 

(9) 

where 



(10) 


s'gS 


We denote and as the operator obtained by applying the mapping T and T,r with themselves 
k times, respectively. It can be shown [3 that the following properties hold for and T. 

(a) The optimal value vector V* is the only solution to the equation V = T V. 

(b) We have limfe_>oo T^V = V*. for every vector V 

(c) A stationary policy is optimal if and only if Tt^V* = TV*. 

(d) For every vector V, we have limfc^oo T^V = 14- And I 4 is the only solution of the equation 
V = T^V 

(e) The operator T is a contraction mapping with respect to a weighted maximum norm. That is, 
there exists a vector p of size |5| and a positive scalar /3 < 1 such that 

\\TV-TV'\\^<P\\V-V'\\^ (11) 

for all vectors V and V', and the weighted maximum norm is ||4||p = maxgg^ 

2.1 Value Iteration 

A principal method, called value iteration, for calculating the optimal value V* is to generate a se¬ 
quence T^V starting from some vector V as lim/c_>.oo = V*. The value functions so computed 
are guaranteed to converge in the limit to the optimal value function. In the stochastic shortest path 
and average reward problems some additional assumptions for convergence are needed. 
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• Finite (N) horizon problem: the algorithm always converge in N steps. 

• Infinite horizon problems with discount rewards: the algorithm always converges to the 
unique optimal solution. 

• Stochastic shortest path problem: the algorithm converges if there is a policy with positive 
probability of termination after at most finite time steps, regardless the initial state. 

• Average Reward problems: the algorithm converges if every state can be reached from 
every other state in finite time step with positive probability for some policy. 


Algorithm 1 Value Iteration 
1: Initial Vq arbitrarily for each state and t = 0 

2: repeat 

3: Compute V* = TVt-i 

4: Compute Residual e* = || V* - Vt-i 

5: t = t + 1 

6: until Ct < e 

1: return Greedy policy 7r(s) = argmax^^/g^ a, s') + 7 Vt(s')] 

a^A 


A commonly used stopping rule is to set e = which ensures the resulting value function is 

within ^ of the optimal value function, and the resulting policy is e'-optimal Il38ll . 

The running time for each iteration in algorithrrlT] is 0(|A| |<S|^). The number of iterations until 
convergence it shown ll22l to be polynomial in the size of the state space |<S| as well as the size of 
action space |,A|, which in turn makes value iteration polynomial in time. However, the order of the 
polynomials is nontrivial, thus in practice value iteration is usually inefficient. 

2.2 Policy Iteration 

Another widely used iterative algorithm is known as policy iteration ini. At each iteration, the 
decision maker first carries out a policy evaluation phase, in which the value function associated 
with the current policy is computed, and a policy improvement phase, in which a greedy attempt is 
made to improve the current policy. 

The basic policy iteration algorithm is described in algorithmic where policy evaluation step in- 


Algorithm 2 Policy Iteration 
1: Let ttq be some random initial policy and t = 0 

2: repeat 

3: Policy Evaluation: compute 14 -^ in equation[TC 

4: Policy Improvement: Trj+i (s) = argmax J2s' ^ss' i^ss' + W))’ for all s S 5 

aeA 

5: t = t + 1 

6: until 7rt_|_i(s) = 7rt(s), for all s S 5 


volves solving a system of S equations with S unknowns. Let p be the invariant distribution of a 
Markov chain and let J\f be the set of non-terminal states and T = S — J\f he the set of zero 
reward termination states in stochastic shortest path problems. 


V-k{N) = {I — + P-wiff,T)R- k{T)) Stochastic Shortest Path 

14 = (J — Discounted Reward (12) 

14 = (1 — Ptt)~^{Rtt — p) Average Reward 

For each iteration, policy evaluation phase can be performed in 0(|iS|^) arithmetic operations and 
policy improvement in 0(|A| |iS| ) operations. When the number of states is large, it’s usually 
preferable to carry out the policy evaluation phase by using iterative methods such as value iteration. 
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It can be shown that the policy iteration algorithm generates an improving sequence of policies and 
terminates with an optimal policy. There is no theoretical guarantees for the number of iterations 
required, yet policy iteration has been listed as one of the preferred solution method for MDR 

2.3 Linear Programming 

A third approach to solve MDPs exactly is based on linear programming Q- The primal linear 
program involves 

Variables; l^(s), Vs G 5 

Minimize: ^ (13) 

Subject to: V(s) > (s')) Vs G 5, Va G ^ 

where p is known as the state relevance weight vector whose elements are all positive. There are 
\A\ |iS| constraints and |<S| variables, one constrainst for each state s and action a. Thus, MDPs can 
be solve in polynomial time. A drawback of this algorithm is that it is typically slower than those 
iterative dynamic programming methods. 

2.4 Temporal Difference Learning 

In this subsection, we discuss an implementation of the Monte Carlo algorithm that incrementally 
updates the value function V (s) after each transition. We first express the value function as 

OO 

14(st) = St+m+l)] 

m—0 

= E[p(st,St+i)+7l4(st+i)] (14) 

The Robbins-Monro stochastic approximation method for solving the above expectation equation 
takes the form 

F(st) = (1 - at)V{st) + at{g{st,st+i) + 7 F(st+i) - V{st)) 

= {1 - at)V(s) + atdt (15) 

where Oft G (0,1) is the learning rate and dt = p(st, St+i)+ 7 V(st+i) — lV(st) is called the temporal 
difference (TD) 1^ . representing the difference between an estimate g{st, 7r(st), st+i) + 7 V(st+i) 
of the value function based on the one-step ahead simulated outcome of the current time step, and the 
current estimate V{st). Alternatively, we might fix a non-negative integer L and take into accounts 
the L -f 1-step ahead simulated outcome, 

L 

K(s*)=E[^ T gi^t+m-j P ^■ni^t+L+l)] (16) 

m—0 

We cannot assume one L better than another in the absence of any special knowledge. For the sake 
of generality, we may combine a weighted average of L-step Bellman equation [16] over all possible 
L. We introduce a constant A < 1, multiply EqjTbjby (1 — A)A^, and sum over all non-negative L. 
We then have, 

OO L 

V,(s,) = (l-A)E[^ A^(^ 

L—0 m—0 

OO OO OO 

= E[(l — A) ^ g{st+m, St+m+l) ^ A"* -f ^(A'^ — A'^''"^)14(st+L+i)] 

m—0 L—m L—0 

OO 

= E[^ A”'7”'d„+t] + K(s0 (17) 

m—0 

The resulting Robbins-Monro stochastic approximation method is then 

OO 

V{st) = (1 - at)V{st) PatY, (18) 

m—t 
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The above equation provides a family of algorithms, one for each A, and is known as TD(A). The 
choice of A reflects a trade-off between bias and variance in the Monte Carlo based approximation. 
The general conclusion from shows that intermediate values of A seem to work best in practise. 
Sutton ll^ has shown that under TD(0), the temporal difference algorithm converges to the true 
value function Dayan Q extended this result to the case of general A. 

A temporal difference based method for learning action values called Q-learning was introduced 
by Waktins iJTll . Q-learning updates directly estimates of the Q-factors associated with an optimal 
policy, thereby avoiding the multiple policy evaluation phases of policy iteration. The following 
learning rule for learning the action value function Q{s, a) is used: 

Qt+i{s,a) = {1 - at)Qtis,a) + at{g{s,a,s') + 'yina,xQt{s\a’)) (19) 

a'eA 

where s' and g{s, a, s') are generated from the pair (s, a) by simulation, according to the transition 
probability matrix Pg^,- Q-learning is sometimes referred to as an ojf-policy learning algorithm 
since it estimates the optimal action value function Q{s,a) while simulation the MDP using any 
policy. During simulation, a sequence of states is generated with the greedy actions provided by 
the current available Q-factors. It’s possible that certain profitable actions are never explored. In 
practice, variants of Q-learning algorithms with parameters control the degree of exploration are 
introduced to ensure sufficient exploration during simulations. 
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3 Compact Representation of Markov Decision Processes 

The solutions described in previous section require a lookup table representations of the value func¬ 
tion l^(s) with size |5|. In environments with large discrete state space is large or even with con¬ 
tinuous state spaces, the time complexity of the MDP solution algorithms makes them inefficient in 
practise. In this section, we review a variety of compact representations for approximately solving 
MDPs, including low dimensional vector space representations by constructing linear basis func¬ 
tions 0, instance based representations of value function using kernels in Hilbert space 1281 , fac¬ 
tored representation ifTSl . hierarchical representations ifSl fTTl . and symbolic representations such as 
binary decision diagrams(BDD) and algebraic decision diagrams(ADD) IIT^ . All these approaches 
depend crucially on a choice of low dimensional compact representation of a MDP, and assume 
these are carefully provided by the human designer. The focus of this section is on approximation, 
rather than automatic representation discovery. 

3.1 Linear Value Function Approximation 

In this subsection, we consider the policy evaluation phase for a single stationary policy tt. Thus 
we suppress in our notation for the value functions the dependence on tt. We approximate the value 
function V(s) with a linear architecture; 

V{s,w) = (j){syw, Vs e <S (20) 

where w is a weight vector and is an 1221-dimensional feature vector associated with state s. 
That is, we represent the value function in a compact form V ~ V = <i)u>, where T* is the |5| x |22| 

matrix that has as rows the feature vectors (j){s), s G S. Thus, we want to approximate the value 

function V with the subspace V spanned by |22| basis function, each of which is in the columns of 
<1). The rank of matrix T* is |22|. Let H be the projection operator on to the linear subspace, with 
respect to some norm ll-ll^: 

\\y\\p = ( 21 ) 

V ses 

where p is a vector of positive components. HV is the unique vector in the subspace that minimizes 

UV = $ 

wv = argmin \\V — $ u>||p 

By setting the gradient of Eq. |23]to 0, we have 

n= $($'22p$)-i22p 

where Dp is the |<S| x |5| diagonal matrix whose entries are p{s) 
operator T^r updating projected value functions, 

$ tU = nT.;r(‘& w) 

^ w = n[i?^ -f ic] (25) 

This equation is known as the projected Bellman’s equation. And the solution (jj of this equation 
is the approximation to value function 14 in the subspace spanned by $. satisfied 

[^'Dp{I - 7P^)$] = ^'DpR^ 

Aw^ = b (26) 

and can be solved by matrix inversion w = A~^b or other iterative algorithms. It can be shown 
that both mapping and HT^ are contraction 1261 with respect to the weighted Euclidean norm 
ll-ll^, where p is the steady state probability vector of the Markov chain with transition probabilities 
Pn-. Analog to value iteration, the so-called projected value iteration algorithm iteratively apply the 
contraction operator flPn-, starting with some arbitrary vector wq 

$ wt+i = nP^($ wt) (27) 


( 22 ) 

(23) 

(24) 

. Now consider the Bellman backup 
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However, the projected value iteration algorithm is not practical when jiSj is large since r^(<i> Wt) is 
of size |<S|, and the steady state probabilities p are assumbed to be known. 

Alternative way to solve equation |26] from simulation trajectories sampled from the Markov chain 
associated with policy tt. After collecting t samples we have 

- 1 ' * 

-7«<>(sfc+i))' (28) 

^ k^O 
1 ^ ^ 

^ k^O 

Given At and bt, one can construct a simulation bases solution 

Wt = (30) 

This is known as the least square temporal difference (LSTD) method. 

Similar to TD(A) method, we can introduce a constant A < 1 and define 

t t 

E - 7<^(sm+i))' (31) 

k—0 m—k 

^ E E s™+i) (32) 

k—0 m—k 

the corresponding matrix inversion solution wt = {A^)~^b^ is called the LSTD(A) method. 

3.2 Factored Markov Decision Processes 

When some structure knowledge about the state space is known, one can construct factored MDP 
representation of the state space using a vector of state variables, and represent the transition models 
between state variables using a dynamic Bayesian network. In this way, the value function can be 
approximated by a linear combination of basis functions, where each basis function involves only 
a small subset of the state variables. In particular, Guestrin et al m proposed an algorithm that 
generalize exact linear programming using basis functions T*. 


Variables: rci,..., 

Minimize: (33) 

Subjectto: T,,Wi(j)As) >T,s' Vse5,VaG7l 

where p is known as the state relevance weight vector whose elements are all positive. The number 
of variables in linear program has now been reduced from jiSj to \V\, the number of basis function in 
sub-space V. Without a factored representation of the state space, the number of constraints remains 
|5| X |7l|. For factored MDPs, the number of constraints can be reduced exponentially by exploiting 
conditional independence properties in the conditional probability table of the dynamic Bayesian 
network. 

3.3 Kernel Based Reinforcement Learning 

In the kernel based reinforcement learning (KBRL) algorithms ll28l [TSl . value functions are ap¬ 
proximated by a set of sample outcomes {st, at, rt, St+i}t=(i- Specifically, KBRL approximates the 
outcome of an action a from a given state s as the convex combination of sampled outcomes of 
that action, weighted by a function of the distance between s and sampled states. Then the Bellman 
backup operator is represented by an operator Tk on the samples: 

V{s) = TkV{s) = Ta.ayiQ{s^a) (34) 

aGA 

Q{s,a)= ^ Kaist,s)[rt+^Vist+i)] (35) 

tG{t:at=a} 
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where the summation is over a subset of indices t where at = a, and the kernel Ka{st, s) is normal¬ 
ized in the sense that for each state s and action a, X]tG{t at=a} ^a{st, s) = 1. 

Kernel-based reinforcement learning has several promising properties. First, the operator Tk has 
a unique fixed point. One can obtain an algorithm analog to value iteration to solve the MDP by 
iteratively applying Tk- Second, the fix point of this operator converges in probability to the true 
value function for the Gaussian Kernel; 

Ka{st,s) = exp[- ^ 

when the number of samples Nt —> cxd and the bandwidth tr 0. The distance metric d{st, s) 
denotes the distance function. However, the time complexity of KBRL is iV|.), which make it im¬ 
practical when the sample size is large. To make it practical, Kveton ll20l employs an unsupervised 
learning method to cluster the simulation samples onto k representative ones, and is able to com¬ 
pute the optimal policy in 0{n) time assuming fc <C n a constant regardless n. Another advantage 
of the kernel based methods is the straightforward incorporation of the structure knowledge of the 
state space by using the structure kernel IISTI . where the kernel Ka{st, s) can be decomposed into a 
product of base kernels. 


The kernel based algorithm defined above requires knowledge about the metric function of the state 
space. Alternatively, the Gaussian Process Temporal Difference (GPTD) lfT3l learning offers a 
Bayesian solution. Consider an episode in which a terminal state is reached at time step T + 1, 
with tt+i = V{Xt+i) = 0. We have a generated model for the value function at state st'- 

V (st) = rt+ jrt+i + ...+ - at (37) 

with at ~ N(0, In a matrix form, we have 

Zt^I-.T = Ti:T + ei:T (38) 

fi-T = Ht+iVi-t + a'l-T (39) 

where 



■ 1 

7 

7^ • 

T n 

■ 7 


■ 1 

-7 

0 ... 

0 ■ 

Zx = 

0 

1 

7 ■ 

T—1 

■ 7 

Ht = Z-\ = 

0 

1 

-7 ■ • ■ 

0 


0 

0 

0 . 

1 


0 

0 

... 1 

-7 . 


Assuming a state-wise noise model with ct ^ N(0, cr^), we have ^ N(0, a^HTH^). 

Since both the value prior and the noise are Gaussian, so is the posterior distribution of the value 
conditioned on an observed sequence of rewards ri-^ = {rt}t=i:T- The joint distribution between 
a test point V (s*) and the observed sequence is; 


f Zt ri-T 

V ^( 5 *) 



Kt Kt{s*) 
Kt{s*Y K{s*,s*) 


(41) 


where Kt denotes the T xT matrix of the covariances evaluated at all pairs of observed states, and 
Kt{s*) denotes the T x 1 vector of the covariances evaluated at pairs of observed state s* and the 
test state s*. The posterior mean and variance of the value at s* are given, respectively, by 

V{s*) = KT{s*f{KT + al)-^ ri:T (42) 

VAR(V(s*)) = K{s*,s*) - KT{s*f{KT + aiy^Kris*) (43) 


3.4 Hierarchical Methods 

Another approach to solving MDPs with large state spaces is to treat them as a hierarchical of task 
structures. In many cases, hierarchical solutions don’t aim at providing an optimal value function 
to a MDP problem, but focus on gaining efficiency in execution time and learning time. Hierar¬ 
chical learners are commonly structured as delegation behaviors. Feudal Q-learning fS] involves a 
hierarchy of learning problems, with higher level agents being masters and lower level agents being 
slaves. The highest level agent receives rewards rt and states st from the external environment. It 
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learns a mapping from states St to some pre-defined intermediate commands and feeds the lower 
level slaves commands and corresponding rewards for taking actions that satisfy the command. The 
lower level agents learns a mapping from commands and states to external actions at- However, the 
set of intermediate commands and their associated reinforcement functions should be established in 
advance of the learning. Similarly, by assuming one can identify useful subgoals and define sub¬ 
tasks that achieve these subgoals, the MAXQ algorithms ifTTIl that decompose the target MDP into 
a hierarchy of smaller MDPs were proposed. Using the MAXQ decomposition, the value function 
of the target MDP can be expressed as an additive combination of the value functions of the smaller 
MDPs. To amend restriction of human designed hierarchy, Mehta et al 1251 further introduced an 
algorithm that can automatic discover the task hierarchy, given that the dynamic Bayesian networks 
associated with the action and reward models are provided, as well as successful sample trajectories 
following the optimal policy. 

3.5 Symbolic Algorithms for Solving MDPs 

We briefly discussed symbolic algorithms in this subsection. The key idea of symbolic algorithms is 
to compactly represent the MDP models (value function, transition probabilities, reward functions, 
etc) using decision diagrams, instead of using the table lookup representation. Similar to aggregation 
methods, these decision diagram representations cluster the states that share similar values. Instead 
of applying Bellman operator to each state, it is sufficient to update the subset of states with similar 
values as a whole at once, by just a single Bellman backup. This representation allows one to 
describe a value function as a function of the variables describing the domain and speeds up the 
value iteration based algorithms. However, these symbolic algorithms assume states in the MDP be 
factored. That is, the state space <S is factored into a set of d boolean state variables s = {si,..., s^}. 
Although any finite valued non boolean variable can be split into a number of boolean variables, it 
often makes the new state space using decision diagram representation larger than the original one 
using the lookup table representation. 
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4 Representation Learning in Markov Decision Processes 

In this section, we discuss methods for constructing compact representation of MDPs. 

4.1 Feature Generation through Automatic Basis Construction 

The policy evaluation phase can be viewed as solving systems of linear equation of the form Aw = b. 
The Krylov space method has long been among the most successful methods currently available for 
efficiently solving systems of linear equations. The fc-order Krylov subspace is the linear subspace 
spanned by the image of b under the first k — 1 powers of A, that is, 

Krylovk{A, b) = span{6, Ab, A^b,..., (44) 

For an MDP, typically we set b = R-j^. The Krylov basis can be significantly accelerated by a 
computational trick called the Schultz expansion, 

OO 

(l-A)-i6= (J +A+ A2+ ...)&= J|(/ + A2')6 (45) 

For example, we can compute the policy evaluation phase as follows: 

OO 

K = (1 = X{{I + {lP.f)R. (46) 

fe =0 


Another way to construct basis automatically is based on the residual error in the current feature 
set (EH- Formally, if <l>fc is the current set of basis functions, the Bellman error basis functions 
(BEBFs) add 4)k+i = R + 'yP^kW^^ — as the next basis function. 

It’s been shown ll^ that a basis $ is not only useful in approximating value functions, but also 
induces a low-dimensional MDP. The induced approximate reward function and approximate 
transition function are defined as 

Rt = (47) 

(48) 

where R^ is the projection of the reward function onto the column space of $, with respect to 
ll-ll^. Similarly, Pjf is the least square solution to the system $P^ « Pn4>. The exact solution to 
this approximate MDP is the same as that given by the exact solution to the original MDP projected 
onto the basis 4>. 

Given basis constructed by Krylov space or BEBF methods with k basis functions, Mahadevan ||2T1 
propose the representation policy iteration algorithm, as described in Algorithm[3] 


Algorithm 3 Model-based representation policy iteration 
1: Let ttq be arbitrary policy and t = 0 

2: repeat 

3: Construct basis matrix <1> 

4: From the MDP compute R^^ and P^ 

5: Find the solution to (1 — 'yP^Jw^ — R^^ 

6: Project solution back to the original state space 

7: Find the greedy policy tt^+i as in the policy improvement phase 

TTt+i (s) = argmax P^’ (Pas' + 7^1 («')) (49) 

aeA 


8: f = f + l 

9: until TTt = TTt+i 
10: return irt+i 
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4.2 Feature Generation through Adaptive State Aggregation 

Another basis construction algorithm H called the adaptive state aggregation partitions the original 
state space S into a set of m subsets iSi,..., Sm, where W^iSi = S and Si fl Sj = 0, for i ^ j. 
We can view state aggregation as a special form of basis matrix $, where each column represents an 
indicator function for each cluster. At each iteration, the algorithm first carries out the regular value 
iteration to compute then corrects, rather than projects, 1/^+^ using the basis matrix 

yk+l =v^ + <^W^ (50) 

where is the solution to the compact policy evaluation problem 

w^ = {I-jPi)-^R^ (51) 

(52) 

- V'^) (53) 

To create the basis T* automatically, Keller ifTOl proposed to use neighborhood component analysis 
(NCA), a supervised learning algorithm with the state s as the input attributes, and the Bellman error 
or the temporal difference error as the supervised signal. In this way, NCA places basis function in 
the lower-dimensional space. The new lower dimensional features are then added as new features 
for the linear function approximator. 

4.3 Structure Learning in Factored MDPs 


Algorithm 4 Structure Learning Algorithm for factored MDP 
1: Initialization 
2: for each time step t do 
3: Given s, TTt-i{s), observe s' and r 

4: Update the factored representation of reward Fact(i?() and transition Fact(Pt) functions. 

5: Learn a policy tt^ using structure value iteration or algorithms for factored MDP. 

6: end for 


Factored MDPs ll6] (TS) compactly represent the transition and reward functions of a MDP using 
dynamic Bayesian networks (DBNs). Efficient algorithms based linear program were developed 
even when the state space is large. However, they require a complete knowledge of the transition 
and reward functions of the problem in advance. Structure learning algorithms fiOl . as sketched 
in Algorithm |4] has been proposed to learn these functions by simulation trials, where decision tree 
induction algorithms are used to learn a factor representation of the reward and transition functions. 
Given the sample transitions {st, at, rt, st+i} observed in a MDP system, decision tree induction 
algorithms learn the compact reward model with {st} being example attributes and {rt} being ex¬ 
ample labels, and learn a conditional probabilities table representation of the transition model with 
{st} being example attributes and {st+i} being example labels. A test is used to detect the in¬ 
dependence between two random variables. After a factored representation of the model is learned 
incrementally, the improved policy can be obtained by an incremental version of structured value 
iteration At the next iteration, the agent will follow the e-greedy variant of the updated policy 
and generate new simulation samples. The algorithm will again update its factored representation 
for the model. 

4.4 Structure Discovery through Compositional Kernel Search 

Unlike the parametric linear function approximation using basis $, Kernel-based reinforcement 
learning (KBRL) Il28l [33] is a popular approach to learning a non-parametric representation of 
the value function, where the similarities between two states are captured by a kernel Ka{s, s'). 
In problems where the state space is factored and s can be expressed as a set of state variables, 
among which there exists some conditional independencies, strucmred kernels 11211 should be used 
to capture the independent relationships. When the conditional independencies between the state 
variables are unknown in advance, kernel learning techniques need to be employed. By defining a 
space of kernel structures which are built compositionally from a context free grammar, we proposed 
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a greedy search algorithm based on the previous works 03 d to search over the grammar and au¬ 
tomatically choose the decomposition structure from raw data by evaluation only a small fraction of 
all structures. We plan to demonstrate how the learned structure can represent and approximate the 
original RL problem in terms of compactness and efficiency, and evaluate our method on a synthetic 
problem and compare it to other RL baselines. 
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5 Related Work and Future Challenges 


The representation learning methods described in this report can be applied to build representations 
from sampled examples over a large variety of problems in AI. They are also close related to recent 
work on manifold learning If34l [T] and spectral learning IZTl . which have largely been applied to 
nonlinear dimensionality reduction and semi-supervised learning problems on graphs. However, 
learning the compact MDP representation introduces new challenges not represented in supervised 
learning and dimensionality reduction, as the set of training examples is not available as a batch, but 
must be collected through active exploration of the state space. Another challenge for representation 
learning in reinforcement learning is how well a compact representation transfers from one problem 
to another. 
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